Transcription to Prediction

Predicting Survival in AML Patients from RNASeq Data using SVM

[redacted]

Introduction

Acute Myeloid Leukemia (AML)

  • A cancer of the blood and bone marrow that affects the myeloid cells.
  • Usually very aggressive with limited therapeutic options.
  • Only ~20% patients can achive durable remission.1
  • Prognosis is usually determined by cytogenetic and molecular markers. They are important factors in determining the appropriate treatment plan.

Dataset & Objectives

As part of the Beat AML 2.0 program, a dataset consisting of clinical outcomes, genomic and transcriptomic data from a cohort of 805 AML patients were collected.2

Among which 571 patients have both RNASeq data and either survived for at least one year after diagnosis (n = 311) or have died (n = 260). We propose a methodology to predict the one-year survival of AML patients using a support vector machine (SVM) model based on transcriptomic (RNASeq) data of these patients.

Methodology

Data Preprocessing and Normalization

  • The raw read count data was normalized into z-scores across the features.
  • The data was then split into training and testing sets with a 80/20 ratio.
  • RNAseq data was then joined with the clinical data to form the final dataset.
# A tibble: 571 × 51,017
   survived ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457
   <lgl>              <dbl>           <dbl>           <dbl>           <dbl>
 1 TRUE              -0.434          -0.123         -0.526          -0.330 
 2 TRUE              -0.322          -0.123         -1.10            0.900 
 3 TRUE              -0.479          -0.123          0.0255         -0.737 
 4 TRUE              -0.299          -0.123         -0.452          -0.336 
 5 TRUE               0.240          -0.123         -0.992           0.696 
 6 FALSE             -0.434          -0.123         -1.25           -1.22  
 7 TRUE               0.689          -0.123         -0.757          -0.586 
 8 TRUE              -0.389          -0.123         -0.927           0.0561
 9 TRUE              -0.479          -0.123         -1.09            0.582 
10 FALSE             -0.479          -0.123         -0.755          -0.796 
# ℹ 561 more rows
# ℹ 51,012 more variables: ENSG00000000460 <dbl>, ENSG00000000938 <dbl>,
#   ENSG00000000971 <dbl>, ENSG00000001036 <dbl>, ENSG00000001084 <dbl>,
#   ENSG00000001167 <dbl>, ENSG00000001460 <dbl>, ENSG00000001461 <dbl>,
#   ENSG00000001497 <dbl>, ENSG00000001561 <dbl>, ENSG00000001617 <dbl>,
#   ENSG00000001626 <dbl>, ENSG00000001629 <dbl>, ENSG00000001630 <dbl>,
#   ENSG00000001631 <dbl>, ENSG00000002016 <dbl>, ENSG00000002079 <dbl>, …

Methodology

Feature Selection

  • We used the Boruta algorithm3 to select the most important features.
  • A brief explanation of the Boruta algorithm:
    • It first creates shadow features by randomly permuting the original features.
    • It then fits a random forest model using both the original and shadow features.
    • If a feature scored a higher Z-score than the shadow features, it is considered important.
    • If a feature repeatedly scored similar to the shadow features, it is considered unimportant and removed from further iterations.
    • The algorithm iteratively removes the unimportant features until all features are either important or unimportant.
  • We ran the Boruta algorithm for 1000 iterations and reduced 51016 => 60 important features.

Methodology

Model Evaluation

  • 80/20 split for training and testing.
  • We care about all 4 quadrants of the confusion matrix!
  • Matthews Correlation Coefficient (MCC) as the primary metric to evaluate the performance of the SVM model.
  • \[ \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \]
  • We also used the ROC curve to visualize some of the results.
  • Ranges from -1 to 1, where 1 is a perfect prediction, 0 is a random prediction, and -1 is a perfect inverse prediction.

Methodology

SVM - First Try

  • Short introduction to SVM:
    • A supervised machine learning algorithm that can be used for classification or regression tasks.
    • It maps data ot a higher dimension space and finds the hyperplane that best separates the data into different classes.
    • It can be used for both linear and non-linear classification tasks using the kernel trick.
  • We fit a default SVM model using the linear kernel and either the whole dataset or the 60 important features.
    • Feature selection is effective.

ROC Curve for bare-bones SVM

Methodology

SVM - \(c\)-values & Non-linear Models

We need to tune the \(c\) value for the SVM algorithm.

  • A low \(c\) value allows for a wider margin but more misclassification.
  • A high \(c\) value allows for a narrower margin but less misclassification.

Large c vs low c

In additional to the linear kernel, we also tried two non-linear kernels:

  • The gaussian kernel: \[ K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{\epsilon}\right) \]

    The Gaussian Kernel
  • The polynomial kernel.

    \[ K(x, x') = (<x, x'> + C)^d \]

    The Polynomial Kernel

Methodology

Model and Hyperparameter Tuning (cont’d)

flowchart LR
   boruta(Boruta Feature Selection) --> svm(SVM)
   svm --> c_values(Choose C Values)
   c_values --> linear_svm(Linear SVM)
   c_values --> gaussian_svm(Gaussian SVM)
   c_values --> polynomial_svm(Polynomial SVM)
   gaussian_svm --> hyper_gaussian(Choose Epsilon)
   polynomial_svm --> hyper_polynomial(Choose Degree and Constant)
   linear_svm ---> model_evaluation(MCC)
   hyper_gaussian --> model_evaluation
   hyper_polynomial --> model_evaluation

Results

SVM - Variable Importance

  • We used a technique called permutation importance to determine the importance of each feature in the SVM model.
# A tibble: 60 × 5
  feature         delta_mcc display_label description                    biotype
  <chr>               <dbl> <chr>         <chr>                          <chr>  
1 ENSG00000198933   -0.109  TBKBP1        TBK1 binding protein 1 [Sourc… protei…
2 ENSG00000226419   -0.0915 SLC16A1-AS1   SLC16A1 antisense RNA 1 [Sour… antise…
3 ENSG00000042832   -0.0915 TG            thyroglobulin [Source:HGNC Sy… protei…
4 ENSG00000165138   -0.0915 ANKS6         ankyrin repeat and sterile al… protei…
5 ENSG00000122378   -0.0740 FAM213A       family with sequence similari… protei…
# ℹ 55 more rows
  • TBKBP1, part of a growth factor signaling axis, already proposed by existing literature as potential tumor growth mediator.4
  • SLC16A1-AS1, multiple research shows it regulates cell cycle in oral squamous cell carcinoma5, clinical data suggest that it contributes to the progression of hepatocellular carcinoma.6
  • TG, a gene that encodes for thyroglobulin, a precursor of thyroid hormones.

Results

SVM - Training the Final Model

  • We still can do something to improve the model (in terms of increasing the prediction power).
  • We have limited number of samples, so we can conserve them by using a 10-fold cross-validation instead of the 80/20 split we’ve been using.
  • A predictor that has seen the input has 10% vote in this!

ROC curve of the final model

Code Availability, License & References

(1)
Pulte, D.; Jansen, L.; Castro, F. A.; Krilaviciute, A.; Katalinic, A.; Barnes, B.; Ressing, M.; Holleczek, B.; Luttmann, S.; Brenner, H.; Group, for the G. C. S. W. Survival in Patients with Acute Myeloblastic Leukemia in Germany and the United States: Major Differences in Survival in Young Adults. International Journal of Cancer 2016, 139 (6), 1289–1296. https://doi.org/10.1002/ijc.30186.
(2)
Bottomly, D.; Long, N.; Schultz, A. R.; Kurtz, S. E.; Tognon, C. E.; Johnson, K.; Abel, M.; Agarwal, A.; Avaylon, S.; Benton, E.; Blucher, A.; Borate, U.; Braun, T. P.; Brown, J.; Bryant, J.; Burke, R.; Carlos, A.; Chang, B. H.; Cho, H. J.; Christy, S.; Coblentz, C.; Cohen, A. M.; d’Almeida, A.; Cook, R.; Danilov, A.; Dao, K.-H. T.; Degnin, M.; Dibb, J.; Eide, C. A.; English, I.; Hagler, S.; Harrelson, H.; Henson, R.; Ho, H.; Joshi, S. K.; Junio, B.; Kaempf, A.; Kosaka, Y.; Laderas, T.; Lawhead, M.; Lee, H.; Leonard, J. T.; Lin, C.; Lind, E. F.; Liu, S. Q.; Lo, P.; Loriaux, M. M.; Luty, S.; Maxson, J. E.; Macey, T.; Martinez, J.; Minnier, J.; Monteblanco, A.; Mori, M.; Morrow, Q.; Nelson, D.; Ramsdill, J.; Rofelty, A.; Rogers, A.; Romine, K. A.; Ryabinin, P.; Saultz, J. N.; Sampson, D. A.; Savage, S. L.; Schuff, R.; Searles, R.; Smith, R. L.; Spurgeon, S. E.; Sweeney, T.; Swords, R. T.; Thapa, A.; Thiel-Klare, K.; Traer, E.; Wagner, J.; Wilmot, B.; Wolf, J.; Wu, G.; Yates, A.; Zhang, H.; Cogle, C. R.; Collins, R. H.; Deininger, M. W.; Hourigan, C. S.; Jordan, C. T.; Lin, T. L.; Martinez, M. E.; Pallapati, R. R.; Pollyea, D. A.; Pomicter, A. D.; Watts, J. M.; Weir, S. J.; Druker, B. J.; McWeeney, S. K.; Tyner, J. W. Integrative Analysis of Drug Response and Clinical Outcome in Acute Myeloid Leukemia. Cancer Cell 2022, 40 (8), 850–864.e9. https://doi.org/10.1016/j.ccell.2022.07.002.
(3)
Kursa, M. B.; Rudnicki, W. R. Feature Selection with the Boruta Package. Journal of Statistical Software 2010, 36, 1–13. https://doi.org/10.18637/jss.v036.i11.
(4)
Zhu, L.; Li, Y.; Xie, X.; Zhou, X.; Gu, M.; Jie, Z.; Ko, C.-J.; Gao, T.; Hernandez, B. E.; Cheng, X.; Sun, S.-C. TBKBP1 and TBK1 Form a Growth Factor Signaling Axis Mediating Immunosuppression and Tumorigenesis. Nature cell biology 2019, 21 (12), 1604–1614. https://doi.org/10.1038/s41556-019-0429-8.
(5)
Feng, H.; Zhang, X.; Lai, W.; Wang, J. Long Non-Coding RNA SLC16A1-AS1: Its Multiple Tumorigenesis Features and Regulatory Role in Cell Cycle in Oral Squamous Cell Carcinoma. Cell Cycle (Georgetown, Tex.) 2020, 19 (13), 1641–1653. https://doi.org/10.1080/15384101.2020.1762048.
(6)
Duan, C. LncRNA SLC16A1-AS1 Contributes to the Progression of Hepatocellular Carcinoma Cells by Modulating miR-411/MITD1 Axis. Journal of Clinical Laboratory Analysis 2022, 36 (4), e24344. https://doi.org/10.1002/jcla.24344.